从0到1学习大语言模型课程——2. 大语言模型GPT-3的能力

在本章节中，将基于CPT-3探讨其能力

在GPT-3的论文中，其在任务上的表现是好坏参半的。

GPT-3论文 :
https://arxiv.org/pdf/2005.14165.pdf

在语言建模等某些任务上，GPT-3 远远超过了最先进的技术。
在其他方面，GPT-3 与经过大量标记数据训练的系统竞争，它远远落后。

为什么会有这些现象呢？

GPT-3没有接受过关于特殊任务的明确训练；它只是作为语言模型来预测下一个单词是什么。
即便如此，GPT-3在广泛的NLP任务中平均表现也还可以。
由于GPT-3没有特殊任务的特殊训练，它没有过度拟合，这意味着它很有可能在许多其他任务上表现出色。
此外，如果想在任何特定任务（例如回答问题）上表现出色，原则上能够使用大量标记数据来适应 GPT-3，以超越最先进的水平。

1. Adaptation

语言模型 是tokens序列上的概率分布，因此我们可以给如下序列进行打分：

语言模型还可用于在给出Prompt的情况下执行Completion：

我们可以定义一个任务是从输入到输出的映射。例如，对于question answering，我们可能有：

输入：伯恩霍加斯建立了什么学校？
产出：视觉艺术学院

我们使用术语Adaptation来指采用语言模型应用于下游任务的过程，给出：

任务的自然语言描述，以及
一组训练实例（输入输出对）。

执行适应的主要方法有两种：

Training: 训练一个将输入映射到输出的新模型

创建一个使用语言模型作为特征的新模型
微调基座语言模型（fine-tuning)
介于两者之间（轻量级微调）。

Prompting(上下文学习)：Prompt（基于描述和训练实例的字符串）或一组Prompts，将它们输入到语言模型中。

Zero-shot：零样本学习
One-shot ：训练样本数为1
Few-shot：有一些训练样本数量

应该采取哪种方式呢？

由于过度拟合，训练可能会很困难（想象一下根据5个示例对 1750 亿个参数模型进行微调）
目前，我们将使用Prompt来Adaptation GPT-3。这里的局限性在于只能利用少量的训练实例。由于Transformers的限制，在 2048 个tokens。

当然现在token已经放开到更多了

GPT-3 论文在大量任务上做了评估了。之后将以如下模版进行阐述

Definition：任务及其动机是什么？
Adaptation：如何Prompting
Results：与特定任务的最先进模型的结果评估。

章节目标：

NLP 任务概述（独立于大型语言模型），
了解 GPT-3 的运作情况，
Prompt engineering

2. Language modeling

思考语言模型可以做什么最自然的起点是问它是否可以做语言模型应该做的事情：model language。

回顾一下上一章节对语言模型的定义，因此基于训练预料, 我们就可以我们可以问：语言模型分配给某一序列tokens的概率是多少？如：

3. Perplexity：

用于形容语言模型预测每一个token的平均分支因子，用来评估语言模型的预测能力。

序列的联合概率： 随着序列长度的增长而趋于零，这使得跟踪变得困难。

perplexity：衡量预测每一个token的可能性。表示代码长度，exp()则表示了可能性的个数，例如：长度为3bit总共可以编码成种字符串可能。

perplexity在如下2个指标上有所体现

1. Recall Error

2. Precision Error

一个语言模型,如果基于概率混入了垃圾数，则

以上表示，如果混入了5%的垃圾数据，则每20个tokens将平均产生一个垃圾token.

4. Penn Tree Bank

Penn Tree Bank：
https://catalog.ldc.upenn.edu/LDC99T42

Penn Tree Bank是 NLP 中的经典数据集，最初是为了句法分析而注释的。从Emami 和 Jelinek (2004)以及Mikolov 和 Zweig (2012)开始，仅包含《华尔街日报》文章的数据集版本被用作语言建模评估。

1. Emami, Jelinek(2004)
https://ieeexplore.ieee.org/document/1325968

2. Mikolov, Zweig (2012)
https://ieeexplore.ieee.org/document/6424228

Adaptation：将整个文本作为提示输入 GPT-3 并评估perplexity:

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

Result： GPT-3 的性能远远优于现有的最先进技术：

GPT-3 : 20.5
BERT-Large-CAs1：31.3

缺陷： 因为 GPT-3 是在维基百科上训练的。很难检查您的测试数据是否出现在训练数据中并且是否已被记住。

5. LAMBADA(Paperno et al. 2016)

link: https://arxiv.org/pdf/1606.06031.pdf

任务：预测句子的最后一个单词。

动机：解决任务需要对远程依赖关系进行建模。

定义： 要求语言模型来完成句子的最后一个单词。

问题： 语言模型不知道它应该生成句子的最终单词。

解决方案： 将其更明确地构建为输入输出映射，并使用上下文学习和其他示例：

Fill in blank:

Alice was friends with Bob. Alice went to visit her friend ___. -> Bob

She held the torch in front of her. She caught her breath. “Chris? There’s a step.” “What?” “A step. Cut in the rock. About fifty feet ahead.” She moved faster. They both moved faster. “In fact,” she said, raising the torch higher, “there’s more than a ___. -> step

结果: GPT-3在这个任务上Perplexity比之前最先进的技术（基于 GPT-2）做得更好：

GPT-3(few-shot) 1.92
SOTA : 8.63

6. Question answering

考虑（闭卷）问答，其中输入是问题，输出是答案。语言模型必须以某种方式“知道”答案，而无需在数据库或一组文档中查找信息。

输入：伯恩霍加斯建立了什么学校？产出：视觉艺术学院

7. TriviaQA(Joshi et al.2017)

link https://arxiv.org/pdf/1705.03551.pdf

Task： 给出一个琐事问题，生成答案

Data: 原始数据集是从普通爱好者那里收集的，

Adaptation: 我们根据训练实例和问题定义一个Prompt，并将Completion作为预测答案：

Q: ‘Nude Descending A Staircase’ is perhaps the most famous painting by which 20th century artist?

A: Marcel Duchamp

Result(Accuracy):

RAG: 68.0
GPT-3: 64.3
GPT-3(few-shot) 71.2 我们还发现，增加模型大小和上下文中训练实例的数量都有助于：GPT-3 在 TriviaQA 上的表现

8. WebQuestion(Berant et al.2013)

link https://aclanthology.org/D13-1160.pdf

Task： 回答问题

Data: 从 Google 搜索查询收集的数据集，最初是为知识库问答而创建的

Adaptation: 与上面相同的Prompt

Q: What school did burne hogarth establish?

A: School of Visual Arts

Result(Accuracy):

RAG: 45.5
GPT-3: 14.4
GPT-3(few-shot) 41.5

9. NaturalQuestion

Task： 回答问题

Data: 从 Google 搜索查询收集的数据集（带有长格式答案）

Adaptation: 与上面相同的Prompt

Q: Who played tess on touched by an angel?

A: Delloreese Patricia Early (July 6, 1931 - November 19, 2017), known professionally as Della Reese.

Result(Accuracy):

RAG: 44.5
GPT-3: 14.6
GPT-3(few-shot) 29.9

10. Translation

Task： 将源语言（例如德语）的句子翻译为目标语言（例如英语）的句子

Data: WMT’14 and WMT’16 datasets.

WMT'14 https://paperswithcode.com/dataset/wmt-2014
WMT'16 https://paperswithcode.com/dataset/wmt-2016

Adaptation: Few-shot

Mein Haus liegt auf dem Hügel. = My house is on the hill.

Keinesfalls dürfen diese für den kommerziellen Gebrauch verwendet werden = In no case may they be used for commercial purposes.

Result(Accuracy):

SOTA(supervised): 40.2
GPT-3(zero-shot): 27.2
GPT-3(few-shot) 40.6

Analysis:

即使没有监督训练数据，GPT-3 也能与最先进的完全监督系统相媲美！
这给出了一个人在机器翻译方面的表现的下限；您肯定会希望利用大量并行语料库（对齐的输入输出对）。
法语和罗马尼亚语的结果相似。
从英语到外语的结果要差得多，这是预料之中的，因为 GPT-3 主要是英语语言模型。

11. Arithmetic

Task：算术（2-5 位加法、减法、乘法）

Adaptation: 转化为问答形式

Q: What is 556 plus 497?

A: 1053

Result(Accuracy):

很难说GPT-3完全“理解算术”，但它的工作效果却已经惊艳了。

12. News article generation

Task：给定标题和副标题，生成一篇新闻文章

Data: 标题/字幕取自newser.com

Evaluation： 人类根据文章由机器撰写的可能性对文章进行评分

Adaptation: 需要Prompt 包含in-context learning。

Title: United Methodists Agree to Historic Split

Subtitle: Those who oppose gay marriage will form their own denomination

Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be "theologically and socially conservative," according to The Washington Post. The majority of delegates attending the church's annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination...

Result(Accuracy):52%的情况下，能够区分“人类”和“机器”创作

13. Novel tasks

Task： 给定一个新的虚构单词和定义，生成一个使用该单词的句子。

Adaptation:

To “screeg” something is to swing a sword at it. An example of a sentence that uses the word screeg is: We screeged the tree with our swords.

14. Correcting English grammar

Task： 给定一个不合语法的句子，纠正语法

Adaptation: Prompt 包含input-output pairs

Poor English input: I eated the purple berries.

Good English output: I ate the purple berries.

Poor English input: Thank you for picking me as your designer. I’d appreciate it.

Good English output: Thank you for choosing me as your designer. I appreciate it.

Poor English input: I’d be more than happy to work with you in another project.

Good English output: I would be happy to work with you on another project.

15. Other Tasks

Benchmarks

SWORDS：词汇替换，目标是预测句子上下文中的同义词。
Massive Multitask Language Understanding：57 个多项选择题，涵盖数学、美国历史、计算机科学、法律等。
TruthfulQA：人类由于误解而错误回答的问题回答数据集。

1. SWORDS:
https://arxiv.org/pdf/2106.04102.pdf

2. Massive Multitask Language Understanding
https://arxiv.org/pdf/2009.03300.pdf

3. TruthfulQA
https://arxiv.org/pdf/2109.07958.pdf

OpenAI 网站示例
https://beta.openai.com/examples/
gpt3demo.com https://gpt3demo.com/

15. Summary

GPT-3在广泛的NLP Benchmarks和其对应的Task进行了评估。
GPT-3 可以表现得非常好，也可以表现得非常平庸。
增加模型的大小和示例的数量都有助于提高性能。
有一些启发式方法可以使语言模型适应感兴趣的任务。
为什么GPT-3这么惊艳有效？没人知道。

16. Further Reading

Language Models are Few-Shot Learners：https://arxiv.org/pdf/2005.14165.pdfNeurIPS 2020。
Blog post explaining perplexity https://towardsdatascience.com/perplexity-in-language-models-87a196019a94